Getting started#
The xdatasets library enables users to effortlessly access a vast collection of earth observation datasets that are compatible with xarray formats.
The library adopts an opinionated approach to data querying and caters to the specific needs of certain user groups, such as hydrologists, climate scientists, and engineers. One of the functionalities of xdatasets is the ability to extract data at a specific location or within a designated region, such as a watershed or municipality, while also enabling spatial and temporal operations.
To use xdatasets, users must employ a query. A straightforward query to extract the variables t2m (2m temperature) and tp (Total precipitation) from the era5_reanalysis_single_levels dataset at two geographical positions (Montreal and Toronto) could be as follows:
query = {
"variables": {"era5_reanalysis_single_levels": ["t2m", "tp"]},
"space": {
"clip": "point", # bbox, point or polygon
"geometry": {'Montreal' : (45.508888, -73.561668),
'Toronto' : (43.651070, -79.347015)
}
}
}
An example of a more complex query would look like this. This query calls the same variables as above. However, instead of specifying geographical positions, GeoPandas.DataFrame is used to provide features (such as shapefiles or geojson) for extracting data within each of them. Each polygon is identified using the unique identifier Station, and a spatial average is computed within each one (aggregation: True). The dataset, initially at an hourly time step, is converted into a daily time
step while applying one or more temporal aggregations for each variable as prescribed in the query. The xdatasets function ultimately returns the dataset for the specified date range and time zone.
query = {
"variables": {"era5_reanalysis_single_levels": ["t2m", "tp"]},
"space": {
"clip": "polygon", # bbox, point or polygon
"aggregation": True,
"geometry": gdf,
"unique_id": "Station"
},
"time": {
"timestep": "D",
"aggregation": {"tp": np.nansum,
"t2m": [np.nanmax, np.nanmin]},
"start": '1959-01-01',
"end": '1961-05-31',
"timezone": 'America/Montreal',
},
}
Note
Don’t worry! Below, you’ll find additional examples that will assist in understanding each parameter in the query, as well as the possible combinations.
Query climate datasets#
In order to utilize xdatasets, you must import at least xdatasets, pandas, geopandas, and numpy. Additionally, we import pathlib to interact with files.
[1]:
%load_ext autoreload
%autoreload 2
[2]:
import xdatasets as xd
import geopandas as gpd
import pandas as pd
import numpy as np
from pathlib import Path
Let’s first access certain polygon features, which can be in the form of shapefiles, geojson, or any other format compatible with geopandas. In this example, we are using JSON (geojson) files.
[3]:
bucket = Path('https://s3.us-east-2.wasabisys.com/watersheds-polygons/MELCC/json')
paths = [bucket.joinpath('023003/023003.json'),
bucket.joinpath('031101/031101.json'),
bucket.joinpath('040111/040111.json')]
Subsequently, all of the files can be opened and consolidated into a single geopandas.GeoDataFrame object.
[4]:
gdf = pd.concat([gpd.read_file(path) for path in paths]).reset_index(drop=True)
gdf
[4]:
| Station | Superficie | geometry | |
|---|---|---|---|
| 0 | 023003 | 208.4591919813271 | POLYGON ((-70.82601 46.81658, -70.82728 46.815... |
| 1 | 031101 | 111.7131058782722 | POLYGON ((-73.98519 45.21072, -73.98795 45.209... |
| 2 | 040111 | 433.440893903503 | POLYGON ((-74.06645 46.02253, -74.06647 46.022... |
Let’s examine the geographic locations of the polygon features.
[5]:
gdf.hvplot(geo=True,
tiles='ESRI',
color='Station',
alpha=0.8,
width=750,
height=450,
legend='top',
hover_cols=['Station','Superficie'])
[5]:
[6]:
%%time
query = {
"variables": {"era5_reanalysis_single_levels": ["t2m", "tp"]},
"space": {
"clip": "polygon", # bbox, point or polygon
"aggregation": False,
"geometry": gdf,
"unique_id": "Station"
},
"time": {
"start": '1979-01-01',
"end": '2020-12-31',
},
}
xds = xd.Dataset(**query)
3it [00:06, 2.13s/it]
CPU times: user 11.6 s, sys: 748 ms, total: 12.4 s
Wall time: 18.2 s
[7]:
xds.data
[7]:
<xarray.Dataset>
Dimensions: (latitude: 6, longitude: 5, time: 368184, Station: 3)
Coordinates:
* latitude (latitude) float32 45.0 45.25 46.0 46.25 46.5 46.75
* longitude (longitude) float32 -74.5 -74.25 -74.0 -71.0 -70.75
* time (time) datetime64[ns] 1979-01-01 ... 2020-12-31T23:00:00
* Station (Station) object '023003' '031101' '040111'
Superficie (Station) object '208.4591919813271' ... '433.440893903503'
Data variables:
t2m (Station, time, latitude, longitude) float32 nan nan ... nan nan
tp (Station, time, latitude, longitude) float32 nan nan ... nan nan
weights (Station, latitude, longitude) float64 nan nan nan ... nan nan
Attributes:
institution: ECMWF
source: Reanalysis
title: ERA5 forecasts[8]:
station = '023003'
ds_station = xds.data.sel(Station=station)
ds_clipped = xds.bbox_clip(ds_station)
ds_clipped
[8]:
<xarray.Dataset>
Dimensions: (time: 368184, latitude: 2, longitude: 2)
Coordinates:
* latitude (latitude) float32 46.5 46.75
* longitude (longitude) float32 -71.0 -70.75
* time (time) datetime64[ns] 1979-01-01 ... 2020-12-31T23:00:00
Station <U6 '023003'
Superficie object '208.4591919813271'
Data variables:
t2m (time, latitude, longitude) float32 270.8 270.3 ... 269.3 268.8
tp (time, latitude, longitude) float32 nan nan nan ... 0.0 0.0 0.0
weights (latitude, longitude) float64 7.449e-06 0.0001531 0.9079 0.09194
Attributes:
institution: ECMWF
source: Reanalysis
title: ERA5 forecasts[9]:
((ds_clipped \
.t2m \
.isel(time=0) \
.hvplot(tiles='ESRI',
geo=True,
width=750,
height=450) * \
gdf[gdf.Station == station] \
.hvplot(geo=True,
color='Station',
alpha=0.8,
width=750,
height=450,
legend='top',
hover_cols=['Station','Superficie'])) + \
ds_clipped \
.weights \
.hvplot(tiles='ESRI',
geo=True,
width=750,
height=450) * \
gdf[gdf.Station == station] \
.hvplot(geo=True,
color='Station',
alpha=0.8,
width=750,
height=450,
legend='top',
hover_cols=['Station','Superficie'])).cols(1)
[9]:
[ ]:
[10]:
%%time
# http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
query = {
"variables": {"era5_reanalysis_single_levels": ["t2m", "tp"]},
"space": {
"clip": "polygon", # bbox, point or polygon
"aggregation": True,
"geometry": gdf,
"unique_id": "Station"
},
"time": {
"timestep": "D",
"aggregation": {"tp": [np.nansum],
"t2m": [np.nanmax, np.nanmin]},
"start": '1979-01-01',
"end": '2020-12-31',
"timezone": 'America/Montreal',
},
}
xds = xd.Dataset(**query)
3it [00:02, 1.07it/s]
Processing tp: 100%|██████████| 2/2 [00:07<00:00, 3.62s/it]
CPU times: user 14.3 s, sys: 892 ms, total: 15.2 s
Wall time: 19.5 s
[11]:
xds.data
[11]:
<xarray.Dataset>
Dimensions: (Station: 3, time: 15342)
Coordinates:
longitude (Station) float64 -70.94 -74.14 -74.27
latitude (Station) float64 46.72 45.18 46.08
* Station (Station) object '023003' '031101' '040111'
Superficie (Station) object '208.4591919813271' ... '433.440893903503'
* time (time) datetime64[ns] 1978-12-31 1979-01-01 ... 2020-12-31
Data variables:
t2m_nanmax (time, Station) float32 271.1 276.4 271.5 ... 273.7 276.1 272.7
t2m_nanmin (time, Station) float32 268.8 274.4 269.7 ... 269.0 271.3 266.3
tp_nansum (time, Station) float32 0.0 0.0 0.0 ... 0.002386 0.002362
Attributes:
institution: ECMWF
source: Reanalysis
title: ERA5 forecasts
timezone: America/Montreal[12]:
xds.data.to_dataframe().hvplot(x='time',
y=['t2m_nanmax','t2m_nanmin'],
grid=True,
width=750,
height=450,
groupby='Station',
legend='top',
widget_location='bottom')
[12]:
[13]:
xds.data.to_dataframe().hvplot(x='time',
y=['tp_nansum'],
grid=True,
width=750,
height=450,
groupby='Station',
legend='top',
widget_location='bottom')
[13]:
[ ]: